HW 01

Author

Nathan Herling

Published

June 6, 2025

0 - Setup

[FYI]
'pacman' already installed — skipping install.
[FYI]
'dsbox' already installed — skipping GitHub install.
The packages loaded:
- tidyverse
- glue
- scales
- lubridate
- patchwork
- ggh4x
- ggrepel
- openintro

1 - Edinburgh Traffic

Question 1


Recreate the plot and interpret in context of the data.

This plot visualizes the distribution of road accidents at different times of day, separated into weekdays and weekends. The data is faceted by whether the day is a weekend, making it easy to compare patterns across these two categories. Accident severity is color-coded to distinguish between fatal, serious, and slight incidents. The plot highlights time-based trends that may help identify peak periods of high-risk activity. This information could support efforts to improve traffic safety or allocate emergency response resources more effectively.

One striking trend is the high number of fatalities on weekdays, in contrast to the weekend plot, which shows no visible fatalities.

2 - NYC marathon winners

Question 2a


What features of the distribution are apparent in the histogram and not the box plot? What features are apparent in the box plot but not in the histogram?

Traditional thought holds that the histogram highlights a bimodal distribution, revealing differences in marathon times between men and women. It effectively shows the shape of the data and how values cluster across different ranges. In contrast, the box plot does not capture modality but provides a summary of central tendency, variability, and outliers. While histograms emphasize the shape of the distribution, box plots offer a concise overview of data spread.

However, the bimodal distribution pattern is still visible in the jittered data points of the box plot. That said, the histogram creates a more striking visual comparison among the data groupings, and their relative counts, especially when contrasted with the more abstract summary offered by the box plot.

Question 2b


Based on the plots you made, compare the distribution of marathon times for men and women.

Box plots are a great way to visualize distributions. Here, we see that the men’s and women’s time distributions slightly overlap, but only when considering some of the men’s outliers. Plotting either alone would not allow us to see the bimodal distribution, plotting together however, allows us to see a distinctly bimodal distribution, where the two groups are largely separated.

Question 2c


What information in the above plot is redundant? Redo the plot avoiding this redundancy. How does this update change the data-to-ink ratio?

In a large number of examples, redundancy is subjective—what one person considers ‘redundancy’ might be seen as a ‘feature’ by another. Here, we can consolidate the data distributions onto a single plot, as they largely do not overlap. This approach eliminates the need to label both graphs separately. Furthermore, to leverage the visual tendency to perceive relative color differences, one of the groups can be changed to gray [cornsilk4]. By keeping our data the same and reducing the amount of ink used, we’ve effectively increased our data/ink ratio, which is the desired effect.

Question 2d


Visualize the marathon times of men and women over the years. As is usual with time series plot, year should go on the x-axis. Use different colors and shapes to represent the times for men and women. Make sure your colors match those in the previous part. Once you have your plot, describe what is visible in this plot but not in the others.

This plot clearly shows how the distribution of finish times has changed since 1970. It also highlights that most of the outliers for both gender groups occur before 1980. The bimodal distribution of the two groups is clearly visible, along with the minimal overlap between them. The general separation in mean finish times is evident in the vertical gap between the two data clusters.

3 - US counties

Question 3a


a. What does the following code do? Does it work? Does it make sense? Why/why not?.

This code attempts to make two differenlty dimensioned plots overlay one another.

ggplot(county) +
  geom_point(aes(x = median_edu, y = median_hh_income)) +
  geom_boxplot(aes(x = smoking_ban, y = pop2017)
  
The first line creates a scatter plot with median_edu on the x-axis and median_hh_income on the y-axis:


ggplot(county) +
  geom_point(aes(x = median_edu, y = median_hh_income)) +
  

The second line adds a boxplot layer with smoking_ban (likely categorical) on the x-axis and pop2017 on the y-axis:


ggplot(county) +
  geom_boxplot(aes(x = smoking_ban, y = pop2017))
  

Both geom_point and geom_boxplot are layered in the same ggplot, but they rely on different x and y variables. On their own, each layer would produce a meaningful plot, but combined, they result in a confusing and misleading visualization—a kind of visual cacophony.

Technically, the code may run without error, but it doesn’t “work” from a data visualization standpoint. Mixing different aesthetics (continuous vs. categorical x-axes) in one plot without coordinating scales or structure leads to a plot that is hard to interpret and potentially misleading.

Conclusion: The code may run without errors, but it doesn’t produce a meaningful visualization. Combining unrelated aesthetic mappings in a single plot without aligning their structure leads to a misleading and ineffective graphic.

Question 3b


Which of the following two plots makes it easier to compare poverty levels (poverty) across people from different median education levels (median_edu)? What does this say about when to place a faceting variable across rows or columns?

We are to compare two graphs, each showing the same data but presented differently. County facets plot County facets plot

An obvious answer is the left graph, where the data is plotted horizontally has more visual striking power. Yet, if we look at how the data is grouped on the vertical (right) graph we see some skewing (topographical compression) of the data geometry represented in the left graph.

Therefore, from these two graphs we cannot conclude that ‘in general’ graphing data such as this horizontally will yield better visual results. To truly make a comparison the vertical graph would need to be stretched out to the same dimension/scale as the horizontal graph.

Question 3c


Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s metro.

Note(s):

  • The exercise appears to want the NA(s) left in.
  • Warnings were suppressed to clean up output.
  • A ‘user alert’ was used instead, re: NA(s)
  • Use ‘zoom’ when needed (ha).
  • I could not get my trend lines to extend as far as the given example(s).

[FYI] Missing values (NA) were intentionally not filtered out prior to plotting.
This allows ggplot2 to handle them automatically, which results in some rows being dropped internally.

→ 2 row(s) contained NA in homeownership or poverty and were skipped by geom_point()
→ 3 row(s) with NA were skipped by the smoothing line (metro == ‘yes’)

4 - Rental apartments in SF

Question 4a


Describe the relationship between income and credit card balance. Touch on how/if the relationship varies based on the four (4) category combinations.

The relationship between income and credit card balance is clearly increasing—higher income tends to correspond with higher balances. This positive trend holds across all categories examined. Notably, the slope of the trend lines is nearly identical when comparing non-married students to non-married non-students, and a similar pattern is seen when comparing married students to married non-students.

However it should be noted that all distributions examined are skewed towards the left end of the wage scale.

This suggests that while income strongly influences credit card balance, the effect of student or marital status on that relationship is minimal. The key takeaway is that income is the dominant factor, with student and marital status having little impact on the strength of that relationship.

Question 4b


Based on your answer to part (a), do you think married and student might be useful predictors, in addition to income for predicting credit card balance? Explain your reasoning.

Yes, with some caveats. Credit card balance is primarily influenced by income across all four categories examined. However, when considering the categories defined by married and student status alongside income, it does appear to be possible to predict credit card balance within a range of certainty.

Question 4c


Calculate credit utilization for all individuals in the credit data, and use it to recreate the following visualization.

Question 4d

Based on the plot from part (c), how, if at all, are the relationships between income and credit utilization different than the relationships between income and credit balance for individuals with various student and marriage status.

The relationships between income and credit utilization now differ more distinctly across the four categories compared to the overall positive trends observed in part (c) with credit balance. Specifically, there is a positive relationship between income and credit utilization for both married and non-married non-students. In contrast, non-married students show a strong negative relationship, which may partly be influenced by the geometry or distribution of the dataset. For married students, the relationship is only slightly negative. Overall, these differences highlight that the patterns seen in credit utilization are not as uniformly positive as those in credit balance, and they vary more noticeably by student and marital status.

5 - Napoleon’s march.

Question 5
The instructions for this exercise are simple: recreate the Napoleon’s march plot by Charles John Minard in ggplot2.
Part a
Here are a couple of sites that helped me with the code and understanding of the figure.

Part b
I added extra code comments with ‘#<—–’ in the code for this problem, to add extra description(s)/proof of knowledge of code functionality.
Part c
For my individualization, I forced the city names to ‘non overlap’ and changed their color to a hue that was legible on both white and brown backgrounds. [#00CED1]
The non-overlapping caused a problem of the names then not all fitting in the graph window. This was solved with a call to: ggrepel::geom_text_repel(..)